Introduction

This IPython notebook illustrates how to perform blocking using Overlap blocker.

First, we need to import py_entitymatching package and other libraries as follows:



In [1]:

    
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

Then, read the (sample) input tables for blocking purposes.



In [2]:

    
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

# Get the paths of the input tables
path_A = datasets_dir + os.sep + 'person_table_A.csv'
path_B = datasets_dir + os.sep + 'person_table_B.csv'



In [3]:

    
# Read the CSV files and set 'ID' as the key attribute
A = em.read_csv_metadata(path_A, key='ID')
B = em.read_csv_metadata(path_B, key='ID')



In [4]:

    
A.head()









    Out[4]:






  
    
      
      ID
      name
      birth_year
      hourly_wage
      address
      zipcode
    
  
  
    
      0
      a1
      Kevin Smith
      1989
      30.0
      607 From St, San Francisco
      94107
    
    
      1
      a2
      Michael Franklin
      1988
      27.5
      1652 Stockton St, San Francisco
      94122
    
    
      2
      a3
      William Bridge
      1986
      32.0
      3131 Webster St, San Francisco
      94107
    
    
      3
      a4
      Binto George
      1987
      32.5
      423 Powell St, San Francisco
      94122
    
    
      4
      a5
      Alphonse Kemper
      1984
      35.0
      1702 Post Street, San Francisco
      94122

Ways To Do Overlap Blocking

There are three different ways to do overlap blocking:

Block two tables to produce a candidate set of tuple pairs.
Block a candidate set of tuple pairs to typically produce a reduced candidate set of tuple pairs.
Block two tuples to check if a tuple pair would get blocked.

Block Tables to Produce a Candidate Set of Tuple Pairs



In [5]:

    
# Instantiate overlap blocker object
ob = em.OverlapBlocker()

For the given two tables, we will assume that two persons with no sufficient overlap between their addresses do not refer to the same real world person. So, we apply overlap blocking on address. Specifically, we tokenize the address by word and include the tuple pairs if the addresses have at least 3 overlapping tokens. That is, we block all the tuple pairs that do not share at least 3 tokens in address.



In [6]:

    
# Specify the tokenization to be 'word' level and set overlap_size to be 3.
C1 = ob.block_tables(A, B, 'address', 'address', word_level=True, overlap_size=3, 
                    l_output_attrs=['name', 'birth_year', 'address'], 
                    r_output_attrs=['name', 'birth_year', 'address']
                    show_progress=False)









    



  File "<ipython-input-6-fd7e6f608750>", line 5
    show_progress=False)
                ^
SyntaxError: invalid syntax



In [ ]:

    
# Display first 5 tuple pairs in the candidate set.
C1.head()

In the above, we used word-level tokenizer. Overlap blocker also supports q-gram based tokenizer and it can be used as follows:



In [ ]:

    
# Set the word_level to be False and set the value of q (using q_val)
C2 = ob.block_tables(A, B, 'address', 'address', word_level=False, q_val=3, overlap_size=3, 
                    l_output_attrs=['name', 'birth_year', 'address'], 
                    r_output_attrs=['name', 'birth_year', 'address'],
                    show_progress=False)



In [ ]:

    
# Display first 5 tuple pairs
C2.head()

Updating Stopwords

Commands in the Overlap Blocker removes some stop words by default. You can avoid this by specifying rem_stop_words parameter to False



In [ ]:

    
# Set the parameter to remove stop words to False
C3 = ob.block_tables(A, B, 'address', 'address', word_level=True, overlap_size=3, rem_stop_words=False,
                    l_output_attrs=['name', 'birth_year', 'address'], 
                    r_output_attrs=['name', 'birth_year', 'address'],
                    show_progress=False)



In [ ]:

    
# Display first 5 tuple pairs
C3.head()

You can check what stop words are getting removed like this:



In [ ]:

    
ob.stop_words

You can update this stop word list (with some domain specific stop words) and do the blocking.



In [ ]:

    
# Include Franciso as one of the stop words
ob.stop_words.append('francisco')



In [ ]:

    
ob.stop_words



In [ ]:

    
# Set the word level tokenizer to be True
C4 = ob.block_tables(A, B, 'address', 'address', word_level=True, overlap_size=3, 
                    l_output_attrs=['name', 'birth_year', 'address'], 
                    r_output_attrs=['name', 'birth_year', 'address'],
                    show_progress=False)



In [ ]:

    
C4.head()

Handling Missing Values

If the input tuples have missing values in the blocking attribute, then they are ignored by default. You can set allow_missing_values to be True to include all possible tuple pairs with missing values.



In [ ]:

    
# Introduce some missing value
A1 = em.read_csv_metadata(path_A, key='ID')
A1.ix[0, 'address'] = pd.np.NaN



In [ ]:

    
# Set the word level tokenizer to be True
C5 = ob.block_tables(A1, B, 'address', 'address', word_level=True, overlap_size=3, allow_missing=True,
                    l_output_attrs=['name', 'birth_year', 'address'], 
                    r_output_attrs=['name', 'birth_year', 'address'],
                    show_progress=False)



In [ ]:

    
len(C5)



In [ ]:

    
C5

Block a Candidata Set To Produce Reduced Set of Tuple Pairs



In [ ]:

    
#Instantiate the overlap blocker
ob = em.OverlapBlocker()

In the above, we see that the candidate set produced after blocking over input tables include tuple pairs that have at least three tokens in overlap. Adding to that, we will assume that two persons with no overlap of their names cannot refer to the same person. So, we block the candidate set of tuple pairs on name. That is, we block all the tuple pairs that have no overlap of tokens.



In [ ]:

    
# Specify the tokenization to be 'word' level and set overlap_size to be 1.
C6 = ob.block_candset(C1, 'name', 'name', word_level=True, overlap_size=1, show_progress=False)



In [ ]:

    
C6

In the above, we saw that word level tokenization was used to tokenize the names. You can also use q-gram tokenization like this:



In [ ]:

    
# Specify the tokenization to be 'word' level and set overlap_size to be 1.
C7 = ob.block_candset(C1, 'name', 'name', word_level=False, q_val= 3, overlap_size=1, show_progress=False)



In [ ]:

    
C7.head()

Handling Missing Values

As we saw with block_tables, you can include all the possible tuple pairs with the missing values using allow_missing parameter block the candidate set with the updated set of stop words.



In [ ]:

    
# Introduce some missing values
A1.ix[2, 'name'] = pd.np.NaN



In [ ]:

    
C8 = ob.block_candset(C5, 'name', 'name', word_level=True, overlap_size=1, allow_missing=True, show_progress=False)

Block Two tuples To Check If a Tuple Pair Would Get Blocked

We can apply overlap blocking to a tuple pair to check if it is going to get blocked. For example, we can check if the first tuple from A and B will get blocked if we block on address.



In [ ]:

    
# Display the first tuple from table A
A.ix[[0]]



In [ ]:

    
# Display the first tuple from table B
B.ix[[0]]



In [ ]:

    
# Instantiate Attr. Equivalence Blocker
ob = em.OverlapBlocker()

# Apply blocking to a tuple pair from the input tables on zipcode and get blocking status
status = ob.block_tuples(A.ix[0], B.ix[0],'address', 'address', overlap_size=1, show_progress=False)

# Print the blocking status
print(status)

	ID	name	birth_year	hourly_wage	address	zipcode
0	a1	Kevin Smith	1989	30.0	607 From St, San Francisco	94107
1	a2	Michael Franklin	1988	27.5	1652 Stockton St, San Francisco	94122
2	a3	William Bridge	1986	32.0	3131 Webster St, San Francisco	94107
3	a4	Binto George	1987	32.5	423 Powell St, San Francisco	94122
4	a5	Alphonse Kemper	1984	35.0	1702 Post Street, San Francisco	94122